A Component Histogram Map Based Text Similarity Detection Algorithm
نویسندگان
چکیده
The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed.This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each components occurrence frequency will be counted for building the component histogram map (CHM) in a text as text characteristic vector. Four distance formulas are used to find which the best distance formula in text similarity detection is. The experiment results indicate that CHM-TSD achieves a better precision, recall and F1 than cosine theorem and Jaccard coefficient.
منابع مشابه
Character Localization From Natural Images Using Nearest Neighbours Approach
Scene text contains significant and beneficial information. Extraction and localization of scene text is used in many applications. In this paper, we propose a connected component based method to extract text from natural images. The proposed method uses color space processing. Histogram analysis and geometrical properties are used for edge detection. Character recognition is done through OCR w...
متن کاملDetection of Fake Accounts in Social Networks Based on One Class Classification
Detection of fake accounts on social networks is a challenging process. The previous methods in identification of fake accounts have not considered the strength of the users’ communications, hence reducing their efficiency. In this work, we are going to present a detection method based on the users’ similarities considering the network communications of the users. In the first step, similarity ...
متن کاملImage Similarity Measure using Color Histogram, Color Coherence Vector, and Sobel Method
Image Retrieval means searching, browsing, and retrieving the images from an image database. Two different methods are used for image retrieval, namely text based image retrieval and content based image retrieval techniques. But now Text based search technique is old one. In Content Based Image Retrieval many visual feature like color, shape, and texture are extracted, next when we query an ima...
متن کاملUncertainty Modeling of a Group Tourism Recommendation System Based on Pearson Similarity Criteria, Bayesian Network and Self-Organizing Map Clustering Algorithm
Group tourism is one of the most important tasks in tourist recommender systems. These systems, despite of the potential contradictions among the group's tastes, seek to provide joint suggestions to all members of the group, and propose recommendations that would allow the satisfaction of a group of users rather than individual user satisfaction. Another issue that has received less attention i...
متن کاملTempo Extraction using Beat Histograms
This abstract describes the tempo extraction algorithm used for the University of Victoria submission to the MIREX (Music Information Retrieval Exchange) 2005. The algorithm is mostly based on self-similarity rather than onset detection. However, an onset detection component is used to calculate the phase of the dominant periodicities. Multiple frequency bands are calculated using a Discrete Wa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- I. J. Network Security
دوره 17 شماره
صفحات -
تاریخ انتشار 2015